In [1]:
from msdas import *
%pylab inline
reload(annotations)


Couldn't import dot_parser, loading of dot files will not be possible.
Populating the interactive namespace from numpy and matplotlib
Out[1]:
<module 'msdas.annotations' from '/home/cokelaer/Work/github/msdas/src/msdas/annotations.pyc'>

Introduction

When reading an input file, the Entry and Entry_name may not be set at all. Besides, full sequence, go terms are not necesseraly provided. We retrieve uniprot entry names and all annotations within the annotations module


In [2]:
filename = yeast.get_yeast_filenames()[0]
r = readers.MassSpecReader(filename)


INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/alpha0.csv
WARNING:root:Some Phospho strings found in Sequence column. No Sequence_Phospho column found.Renaming Sequence into Sequence_Phospho
INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost

Right now, this dataframe/MassSpecReader contains the data and some metadata but no information such as UniProt entry. Besides, GO terms and uniprot intact information could be retrieved from UniProt. The annotations module provides tools to automatically fetch this kind of information.

The input can be a filename or an existing MassSpecReader


In [3]:
a = annotations.Annotations(r, "YEAST", verbose=True)


INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost
WARNING:root:Entry column not found in the dataframe. call get_uniprot_entries
INFO:root:Initialising UniProt service (REST)

In [4]:
a.annotations #empty for now

In [5]:
a._mapping # empty for now


Out[5]:
{}

In [6]:
a.get_uniprot_entries()   # need a network connection. May take some seconds


INFO:root:Fetching uniprot accession numbers for 57 entries
INFO:root:Fetching uniprot accession numbers for 23 unique entries
WARNING:root:deprecated in version 1.3.1. Use mapping instead
INFO:root:getUserAgent: Begin
INFO:root:getUserAgent: user_agent: EBI-Sample-Client/ (services.pyc; Python 2.7.3; Linux) Python-requests/2.7.0
INFO:root:getUserAgent: End
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.uniprot.org

In [7]:
a._mapping


Out[7]:
{u'DIG1_YEAST': [u'Q03063'],
 u'DIG2_YEAST': [u'Q03373'],
 u'FAR1_YEAST': [u'P21268'],
 u'FPS1_YEAST': [u'P23900'],
 u'FUS3_YEAST': [u'P16892'],
 u'GPA1_YEAST': [u'P08539'],
 u'GPD1_YEAST': [u'Q00055'],
 u'HOG1_YEAST': [u'P32485'],
 u'HOT1_YEAST': [u'Q03213'],
 u'PBS2_YEAST': [u'P08018'],
 u'PTP2_YEAST': [u'P29461'],
 u'RCK2_YEAST': [u'P38623'],
 u'SIC1_YEAST': [u'P38634'],
 u'SKO1_YEAST': [u'Q02100'],
 u'SLN1_YEAST': [u'P39928'],
 u'SSK1_YEAST': [u'Q07084'],
 u'SSK2_YEAST': [u'P53599'],
 u'STE11_YEAST': [u'P23561'],
 u'STE12_YEAST': [u'P13574'],
 u'STE20_YEAST': [u'Q03497'],
 u'STE2_YEAST': [u'D6VTK4'],
 u'STE50_YEAST': [u'P25344'],
 u'TEC1_YEAST': [u'P18412']}

In [8]:
a.df[['Protein', 'Psite', 'Entry']].ix[0:10]


Out[8]:
Protein Psite Entry
0 DIG1 S126+S127 Q03063
1 DIG1 S142 Q03063
2 DIG1 S272 Q03063
3 DIG1 S272^S275 Q03063
4 DIG1 S272^T277^S279 Q03063
5 DIG1 S330 Q03063
6 DIG1 S395 Q03063
7 DIG2 S225 Q03373
8 DIG2 S84 Q03373
9 DIG2 T83 Q03373
10 FAR1 S114 P21268

In [8]:


In [9]:
a.set_annotations()


INFO:root:Fectching information from uniprot. Takes some time
INFO:root:fetching information from uniprot for 23 entries
INFO:root:uniprot.get_df 1/1
WARNING:root:column could not be parsed. Protein families
WARNING:root:column could not be parsed. interactor
WARNING:root:column could not be parsed. Subcellular location
INFO:root:Fectching 23
INFO:root:Annotations have been loaded. You can save the annotations dataframe attribute using x.to_pickle('annotations.pkl')  Next time, you could just load if using 

     >>> m = readers.MassSpecReader(filename, mode='yeast')
     >>>  m.read_annotations('annotations.pkl')

In [10]:
a.df[['Protein', 'Psite', 'Entry']].ix[0:10]


Out[10]:
Protein Psite Entry
0 DIG1 S126+S127 Q03063
1 DIG1 S142 Q03063
2 DIG1 S272 Q03063
3 DIG1 S272^S275 Q03063
4 DIG1 S272^T277^S279 Q03063
5 DIG1 S330 Q03063
6 DIG1 S395 Q03063
7 DIG2 S225 Q03373
8 DIG2 S84 Q03373
9 DIG2 T83 Q03373
10 FAR1 S114 P21268